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,  Abstract 

Connectionist  learning  models  have  had  considerable  empirical  success,  but  it  is 
hard  to  characterize  exactly  what  they  learn.  The  learning  of  finite-state  languages 
(FSL)  from  example  strings  is  a  domain  which  has  been  extensively  studied  and 
might  provide  an  opportunity  to  help  understand  connectionist  learning.  A  major 
problem  is  that  traditional  FSL  learning  assumes  the  storage  of  all  examples  and 
thus  violates  connectionist  principles.  This  paper  presents  a  provably  correct 
algorithm  for  inferring  any  minimum-state  deterministic  finite-state  automata 
(FSA)  from  a  complete  ordered  sample  using  limited  total  storage  and  without 
storing  example  strings.  The  algorithm  is  an  iterative  strategy  that  uses  at  each 
stage  a  current  encoding  of  the  data  considered  so  far,  and  one  single  sample  string. 
One  of  the  crucial  advantages  of  our  algorithm  is  that  the  total  amount  of  space,  used 
in  the  course  of  learning,  for  encoding  any  finite  prefix  of  the  sample  is  polynomial  in 
the  size  of  the  inferred  minimum  state  deterministic  FSA.  The  algorithm  is  also 
relatively  efficient  in  time  and  has  been  implemented.  More  importantly,  there  is  a 
connectionist  version  of  the  algorithm  that  preserves  these  properties.  The 
connectionist  version  requires  much  more  structure  than  the  usual  models  and  has 
not  yet  been  implemented.  But  it  does  significantly  extend  the  scope  of  connectionist 
learning  systems  and  helps  relate  them  to  other  paradigms.  We  also  show  that  no 
machine  with  finite  working  storage  can  identify  iteratively  the  FSL  from  arbitrary 
presentations. 
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20.  ABSTRACT  (Continued) 

ordered  sample  using  limited  total  storage  and  without  storing  example  strings. 
The  algorithm  is  an  iterative  strategy  that  uses  at  each  stage  a  current  er- 
coding  of  the  data  considered  so  far,  and  one  single  sample  string.  One  of 
the  crucial  advantages  of  our  algorithm  is  that  the  total  amount  of  space, 
used  in  the  course  of  learning,  for  encoding  any  finite  prefix  of  the  sample 
is  polynomial  in  the  size  of  the  inferred  minimum  state  deterministic  FSA.  The 
algorithm  is  also  relatively  efficient  irytime  and  has  been  implemented.  More 
importantly,  there  is  a  connectioni st  version  of  the  algorithm  that  preserve? 
these  properties.  The  connectioni st  version  requires  much  more  structure  thar 
the  usual  models  and  has  not  yet  been  implemented.  But  it  does  significantly 
extend  the  scope  of  connectionist  learning  systems  and  helps  relate  them  to 
other  paradigms.  We  also  show  that  no  machine  with  finite  working  storage  can 
identify  iteratively  the  FSL  from  arbitrary  presentations. 
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I.  Introduction 

The  ability  to  adapt  and  learn  has  always  been  considered  the  hallmark  of 
intelligence,  but  machine  learning  has  proved  to  be  very  difficult  to  study.  There  is 
currently  a  renewed  interest  in  learning  in  the  theoretical  computer  science 
community  [Va  84,  Va  85,  KLPV  87,  Na  87,  RSI  87,  RS2  87]  and  a,  largely  separate, 
explosive  growth  in  the  study  of  learning  in  connectionist  networks  [Hi  87],  One 
purpose  of  this  paper  is  to  establish  some  connections  (sic)  between  these  two 
research  programs. 

The  setting  for  this  paper  is  the  abstract  problem  of  inferring  Finite  State 
Automata  (FSA)  from  sample  input  strings,  labelled  as  +  or  -  depending  on  whether 
they  are  to  be  accepted  or  rejected  by  the  resulting  FSA.  This  problem  has  a  long 
history  in  theoretical  learning  studies  [An  76,  An  81,  An  87]  and  can  be  easily 
mapped  to  common  connectionist  situations.  There  are  arguments  [Br  87]  that 
interacting  FSA  constitute  a  natural  substrate  for  intelligent  systems,  but  that  issue 
is  beyond  the  scope  of  this  paper. 

We  will  start  with  a  very  simple  sample  problem.  Suppose  we  would  like  a 
learning  machine  to  compute  an  FSA  that  will  accept  exactly  those  strings  over  the 
alphabet  {a, 6}  that  contain  an  even  number  of  o’s.  One  minimal  answer  would  be  the 
following  two-state  FSA. 


b 
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Figure  1:  A  parity  FSA 

We  adopt  the  convention  that  states  drawn  with  one  circle  are  rejecting  states 
and  those  drawn  with  a  double  circle  are  accepting.  The  FSA  always  starts  in  state 
qo,  which  is  accepting  iff  the  empty  string  A  is  to  be  accepted.  We  will  present  in 
Section  3  an  algorithm  that  will  always  learn  the  minimum  state  deterministic  FSA 
for  any  finite  state  language  which  is  presented  to  the  learning  algorithm  in  strict 
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lexicographic  order.  There  are  a  number  of  issues  concerning  this  algorithm,  its 
proof  and  its  complexity  analysis  that  are  independent  of  any  relation  to  parallel  and 
connectionist  computation. 

It  turns  out  that  the  "even  a’s”  language  is  the  same  as  the  well-studied  "parity 
problem”  in  connectionist  learning  [Hi  87].  The  goal  there  is  to  train  a  network  of 
simple  units  to  accept  exactly  binary  strings  with  an  even  number  of  l’s.  In  the 
usual  connectionist  situation,  the  entire  string  (of  fixed  length)  is  presented  to  a 
bottom  layer  of  units  and  the  answer  read  from  a  pair  of  decision  units  that  comprise 
the  top  layer.  There  are  also  intermediate  (hidden)  units  and  it  is  the  weights  on 
connections  among  all  the  units  which  the  connectionist  network  modifies  in 
learning. 

The  parity  problem  is  very  difficult  for  existing  connectionist  learning  networks 
and  it  is  instructive  to  see  why  this  is  so.  The  basic  reason  is  that  the  parity  of  a 
string  is  a  strictly  global  property  and  that  standard  connectionist  learning 
techniques  use  only  local  weight-change  rules.  Even  when  a  network  can  be  made  to 
do  a  fairly  good  job  on  a  fixed-length  parity  problem,  it  totally  fails  to  generalize  to 
shorter  strings.  Of  course,  people  are  also  unable  to  compute  the  parity  of  a  long 
binary  string  in  parallel.  What  we  do  in  this  situation  is  much  more  like  the  FSA  of 
Figure  1.  So  one  question  concerns  the  feasibility  of  connectionist  FSA  systems. 

There  are  many  ways  to  make  a  connectionist  version  of  an  FSA  like  that  of 
Figure  1.  One  of  the  simplest  assigns  a  connectionist  unit  to  each  state  and  to  the 
answer  units  +  and  -.  It  is  convenient  to  add  an  explicit  termination  symbol  I-  and 
to  use  conjunctive  connections  [FB  82]  to  capture  transitions.  The  "current  input 
letter”  is  captured  as  the  activity  of  exactly  one  of  the  top  three  units.  Figure  2  is  the 
equivalent  of  Figure  1  under  this  transformation. 

Thus  unit  0  corresponds  to  the  accepting  state  qo  in  Figure  1  beccause  when  it  is 
active  and  the  input  symbol  is  I-,  the  answer  +  is  activated.  Similarly,  activity  in 
unit  1  and  in  the  unit  for  a  leads  to  activity  in  unit  0  for  the  next  time  step.  Note  that 
activity  is  allowed  in  only  one  of  the  units  0,  1,  +,  -  for  each  step  of  the 
(synchronous)  simulation.  In  Section  5,  we  will  show  how  the  construction  of  Section 
3  can  be  transformed  into  one  which  has  a  connectionist  system  learn  to  produce 
subnets  like  that  of  Figure  2.  There  have  been  some  attempts  [Wi  87]  to  extend 


Figure  2:  A  connectionist  parity  network 


conventional  connectionist  learning  techniques  to  sequences,  but  our  approach  is 
quite  different.  It  would  be  intersting  to  compare  the  various  techniques. 

More  generally,  we  are  interested  in  the  range  of  applicability  of  various  learning 
techniques  and  on  how  theoretical  results  can  contribute  to  the  development  of 
learning  machines.  The  starting  point  for  the  current  investigation  was  the 
application  of  the  theory  of  learning  FSA  to  connectionist  systems.  As  always,  the 
assumptions  in  the  two  cases  were  quite  different  and  had  to  be  reconciled.  There  is, 
as  yet,  no  precise  specification  on  what  constitutes  a  "connectionist”  system,  but 
there  are  a  number  of  generally  accepted  criteria.  The  truism  that  any  machine  can 
be  built  from  linear  threshold  elements  is  massively  irrelevant.  Connectionist 
architectures  are  characterized  by  highly  parallel  configurations  of  simple 
processors  exchanging  very  simple  messages.  Any  system  having  a  small  number  of 
control  streams,  an  interpreter  or  large  amounts  of  passive  storage  is  strongly  anti- 
connectionist  in  spirit.  It  is  this  last  characteristic  that  eliminated  almost  all  the 
existing  formal  learning  models  as  the  basis  for  our  study.  Most  work  has  assumed 
that  the  learning  device  can  store  all  of  the  samples  it  has  seen,  and  base  its  next 
guess  on  all  this  data.  There  have  been  a  few  studies  on  "iterative”  learning  where 
the  guessing  device  can  store  only  its  last  guess  and  the  current  sample  [Wi  76,  JB 
81,  OSW  86].  Some  of  the  techniques  from  [JB  81]  have  been  adapted  to  prove  a 
negative  result  in  Section  4.  We  show  that  a  learning  device  using  any  finite  amount 
of  auxiliary  memory  can  not  learn  the  Finite  State  Languages  (FSL)  from  unordered 
presentations. 
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Another  important  requirement  for  a  model  to  be  connectionist  is  that  it  adapt. 
That  is,  a  connectionist  system  should  reflect  its  learning  directly  in  the  structure  of 
the  network.  This  is  usually  achieved  by  changing  the  weights  on  connections 
between  processing  elements.  One  also  usually  requires  that  the  learning  rule  be 
local;  a  homunculus  with  a  wire-wrap  gun  is  decidedly  unconnectionist.  All  of  these 
criteria  are  based  on  abstractions  of  biological  information  processing  and  all  were 
important  in  the  development  of  this  paper.  The  algorithm  and  proof  of  Section  3  do 
not  mention  them  explicitly,  but  the  results  arose  from  these  considerations.  After  a 
pedagogical  transition  in  Section  5a,  Section  5b  presents  the  outline  of  a  FSL  learner 
that  is  close  to  the  connectionist  spirit.  Error  tolerance,  another  connectionist  canon, 
is  only  touched  upon  briefly  but  appears  to  present  no  fundamental  difficulties. 

In  a  general  way,  the  current  guess  of  any  learning  algorithm  is  an  approximate 
encapsulation  of  the  data  presented  to  it.  Most  connectionist  paradigms  and  some 
others  [Va  84,  Ho  69]  assume  that  the  learner  gets  to  see  the  same  data  repeatedly 
and  to  refine  its  guesses.  It  is  not  surprising  that  this  can  often  be  shown  to 
substitute,  in  the  long  run,  for  storing  the  data.  As  mentioned  above,  we  show  in 
Section  4  that  in  general  an  algorithm  with  limited  storage  will  not  be  able  to  learn 
(even)  FSA  on  a  single  pass  through  the  data.  But  there  is  a  special  case  in  which 
one  pass  does  suffice  and  that  is  the  one  we  consider  in  Section  3. 

The  restriction  that  makes  possible  FSA  learning  in  a  single  pass  is  that  the 
learning  algorithm  be  presented  with  the  data  in  strict  lexicographic  order,  that  is, 
±  A,  ±a,  ±6,  ±aa, ... .  In  this  case  the  learner  can  construct  an  FSA,  referred  to  also 
as  the  current  guess,  that  exactly  captures  the  sample  seen  so  far.  The  FSA  is  non- 
deterministic,  but  consistent  --  every  path  through  the  FSA  gives  the  same  result  for 
every  sampled  string  considered  so  far.  It  turns  out  that  this  is  a  minimal  state  FSA 
consistent  with  the  data  and  :an  thus  be  viewed  as  best  guess  to  date.  The  idea  of 
looking  at  strict  lexicographic  orders  came  to  us  in  considering  the  algorithm  of 
Rivest  and  Schapire  [RSI  87].  Their  procedure  is  equivalent  to  receiving  ±  samples 
in  strict  order. 

Since  the  sample  is  presented  in  lexicographic  order,  our  learning  algorithm  will 
be  able  to  build  up  its  guesses  in  a  cumulative  way.  If  the  empty  string  is  (is  not)  in 
the  inferred  language  L,  then  the  first  guess  is  a  machine  with  one  accepting 
(rejecting)  state.  Each  subsequent  example  is  either  consistent  with  the  current 
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guess,  or  leads  to  a  new  guess.  The  details  of  this  comprise  the  learning  algorithm  of 
Section  3.  When  a  new'  state  is  added  to  the  current  guess,  a  set  of  incoming  and 
outgoing  links  to  and  from  this  new  state  are  added.  Consider  the  "even  as" 
language.  With  the  sample  +  A,  the  initial  accepting  state  qo  has  links  to  itself 
under  every  letter.  These  links  are  all  mutable  and  may  later  be  deleted.  When  -a  is 
presented,  the  self-looping  link  under  a  is  deleted  and  replaced  by  a  permanent  link 
to  a  new  rejecting  state  qj.  We  further  add  a  mutable  link  from  qo  to  qi  under  fa,  and 
the  whole  set  of  links  from  qi.  Figure  3  shows  the  guess  for  the  "even  g’s”  language 
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Figure  3:  An  intermediate  guess  for  the  parity  FSA 

after  the  initial  sample  +  A  and  -a.  The  link  from  qo  to  qi  under  b  is  pruned  when 
+  fa  is  presented.  +aa  will  imply  the  deletion  of  the  current  self-link  of  qj  under  a, 
and  ~ab  will  finally  change  the  guess  to  that  of  Figure  1. 

The  remainder  of  the  paper  is  divided  into  three  major  sections.  Section  3 
considers  the  general  problem  of  learning  FSA  from  lexicographically  ordered 
strings.  An  algorithm  is  presented  and  its  space  and  time  complexity  are  analyzed. 
The  proof  of  correctness  for  this  algorithm  in  Section  3c  uses  techniques  from 
verification  theory  that  have  apparently  not  been  used  in  the  learning  literature.  In 
Section  4  we  show  that  the  strong  assumption  of  lexicographic  order  is  necessary  - 
no  machine  with  finite  storage  can  learn  the  FSA  from  arbitrary  samples.  Section  5 
undertakes  the  translation  to  the  connectionist  framework.  This  is  done  in  two 
steps.  First  a  distributed  and  modular,  but  still  conventional  version,  is  described. 
Then  a  transformation  of  this  system  to  a  connectionist  network  is  outlined.  Some 
general  conclusions  complete  the  paper. 
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2.  Relation  to  previous  work 

The  survey  by  Angluin  and  Smith  [AS  83]  is  the  best  overall  introduction  to 
formal  learning  theory;  we  just  note  some  of  the  most  relevant  work.  Our  learning 
algorithm  (to  be  presented  in  the  next  section)  identifies  the  minimum  state 
deterministic  FSA  (DFSA)  for  any  FSL  in  the  limit:  Eventually  the  guess  will  be  the 
minimum  state  DFSA,  but  the  learner  has  no  way  of  knowing  when  this  guess  is 
found.  The  learner  begins  with  no  a  priori  knowledge.  We  can  regard  the  sample 
data  as  coming  from  an  unknown  resettable  machine  that  identifies  the  inferred  FSL. 
As  stated  above,  our  algorithm  is  an  iterative  strategy  that  uses  at  each  stage  a 
current  encoding  of  the  data  considered  so  far,  and  the  single  current  sample  string. 
One  of  the  crucial  advantages  of  our  algorithm  is  that  the  total  amount  of  space  used 
in  the  course  of  learning,  for  encoding  any  finite  prefix  of  the  sample,  is  polynomial 
in  the  size  of  the  inferred  minimum-state  DFSA.  Any  encoding  of  a  target  grammar 
requires  0(n2)  space,  and  our  algorithm  is  0(n2). 

Iterative  learning  strategies  have  been  studied  in  [Wi  76,  JB  81,  OSW  86]. 
Jantke  and  Beick  [JB  81]  prove  that  there  is  a  set  of  functions  that  can  be  identified 
in  the  limit  by  an  iterative  strategy,  using  the  strict  lexicographic  presentations  of 
the  functions,  but  this  set  can  not  be  identified  in  the  limit  by  an  iterative  strategy 
using  arbitrary  presentations.  The  proof  can  be  slightly  modified  in  order  to  prove 
that  there  is  no  iterative  algorithm  that  can  identify  the  FSL  in  the  limit,  using 
arbitrary  representations  for  the  languages.  We  generalize  the  definition  of  an 
iterative  device  to  capture  the  ability  to  use  any  finite  auxiliary  memory  in  the 
course  of  learning.  Hence,  our  result  is  stronger  than  that  in  [JB  81]. 

Gold  [Go  67]  gives  algorithms  for  identifying  FSA  in  the  limit  both  for  resettable 
and  nonresettable  machines.  These  algorithms  identify  by  means  of  enumeration. 
Each  experiment  is  performed  in  succession,  and  in  each  stage  all  the  experiments 
performed  so  far  are  used  in  order  to  construct  the  next  guess.  Consequently,  the 
storage  needed  until  the  correct  guess  is  reached  is  exponential  in  the  size  of  the 
minimum  state  DFSA.  The  enumeration  algorithm  for  resettable  machines  has  the 
advantage  (over  our  algorithm)  that  it  does  not  specify  the  experiments  to  be 
performed;  it  can  use  any  data  that  identifies  the  inferred  FSL.  This  property  is  not 
preserved  when  the  machines  are  nonresettable. 
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Gold  [Go  72]  introduces  another  learning  technique  for  identifying  a  minimum 
state  DFSA  in  the  limit  by  experimenting  with  a  resettable  machine.  This  variation 
is  called  the  state  characterization  method  which  is  much  simpler  computationally. 
This  technique  specifies  the  experiments  to  be  performed,  and  again  has  the 
disadvantage  of  having  to  monitor  an  infinitely  increasing  storage  area. 

Angluin  [An  87]  bases  her  result  upon  the  method  of  state  characterization,  and 
shows  how  to  infer  the  minimum  state  DFSA  by  experimenting  with  the  unknown 
automata  (asking  membership  queries),  and  using  an  oracle  that  provides 
counterexamples  to  incorrect  guesses.  Using  this  additional  information  Angluin 
provides  an  algorithm  that  learns  in  time  polynomial  in  the  maximum  length  of  any 
counterexample  provided  by  the  oracle,  and  the  number  of  states  in  the  minimum- 
state  DFSA.  This  algorithm  is  comparable  to  ours  in  the  sense  that  it  uses 
experiments  that  are  chosen  at  will. 

Recently  Rivest  and  Schapire  [RSI  87,  RS2  87]  presented  a  new  approach  to  the 
problem  of  learning  in  the  limit  by  experimenting  with  a  nonresettable  FSA.  They 
introduce  the  notion  of  diversity  which  is  the  number  of  equivalence  classes  of  tests 
(basically,  an  experiment  from  any  possible  state  of  the  inferred  machine).  The 
learning  algorithm  uses  a  powerful  oracle  for  determining  the  equivalence  between 
tests,  and  finds  the  correct  DFSA  in  time  polynomial  in  the  diversity.  Since  the 
lower  bound  on  the  diversity  is  log  the  number  of  states,  and  it  is  the  best  possible, 
this  algorithm  is  practically  interesting.  Again,  the  experiments  in  this  algorithm 
are  chosen  at  will,  and  in  fact  they  are  a  finite  prefix  of  a  lexicographically  ordered 
sample  of  the  inferred  language. 

Another  variation  of  automaton  identification  is  that  from  a  given  finite  subset  of 
the  input-output  behavior.  Bierman  and  Feldman  [BF  72]  discuss  this  approach. 
The  learning  strategy  there  includes  an  adjusted  parameter  for  inferring  DFSAs 
with  varying  degrees  of  accuracy,  accomplished  by  algorithms  with  varying 
complexities.  In  general,  Gold  [Go  78]  and  Angluin  [An  78]  prove  that  finding  a 
DFSA  of  n  states  or  less  that  is  compatible  with  a  given  data  is  NP-complete.  On  the 
other  hand,  Trakhtenbrot  and  Barzdin  [TB  73]  and  Angluin  [An  76]  show  that  if  the 
sample  is  uniform-complete ,  i.e.  consists  of  all  strings  not  exceeding  a  given  length 
and  no  others,  then  there  is  a  polynomial  time  algorithm  (on  the  size  of  the  whole 
sample)  that  find  the  minimum  state  DFSA  that  is  compatible  with  it.  Note  that  the 
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sample  size  is  exponential  in  the  size  of  the  longest  string  in  the  sample.  We  can 
regard  our  algorithm  as  an  alternative  method  for  identifying  the  minimum  state 
DFSA  from  a  given  uniform-complete  sample.  As  stated  above,  our  algorithm  is 
much  more  efficient  in  space,  since  it  does  not  access  the  whole  sample,  but  rather 
refers  to  it  in  succession,  and  needs  just  a  polynomial  space  in  the  number  of  states  in 
the  minimum  state  DFSA.  The  time  needed  for  our  algorithm  is  still  polynomial  in 
the  size  of  the  whole  sample,  though  logarithmic  in  an  amortized  sense,  as  we  show 
in  Section  3d. 

3.  Sequential  version  for  learning  FSA 
3a.  Notation  and  definitions 

We  use  the  following  notation  and  definitions: 

*  A  finite-state  automata  (FSA)  M  is  a  5-tuple  (Q,  £,  8,  qo,  F)  where 

•  Q  is  a  finite  nonempty  set  of  states. 

•  £  is  a  finite  nonempty  set  of  letters. 

•  8  is  a  transition  function  that  maps  each  pair  (q,o)  to  a  set  of  states,  where 
q€Q  and  o€  I.  This  function  can  be  represented  by  the  set  of  links  E  so  that 
(p  o  q)  €  E  iff  q  €  8(p,o).  Each  link  is  either  mutable  or  permanent. 

*  8  can  be  naturally  extended  to  any  string  x  £  £*  in  the  following  way:  8(q,  A)  = 

{q},  and  for  every  string  x  £  2*  and  for  every  letter  o  (  8(q,  x  o)  =  {p| 

(3r€Q)  (r  €8(q,x)  and  p  €8(r,o))}. 

•  qo  is  the  initial  state,  qo  €  Q. 

•  F  is  the  set  of  accepting  states,  F  C  Q.  (Q-F  is  called  the  set  of  rejecting 
states). 

•  The  parity  fXq)  of  a  state  q6Qis  +  ifq€F  and  is  -  if  q  €  Q-F.  By 
extension,  assuming  for  some  q€Q  and  a€£  that  8(q,o)  *  0,  we  define  the 
parity  of  this  state-symbol  pair  fiq,o)  to  be  +  if  all  successors  of  q  under  o 
are  +  and  -  if  they  are  all  -.  If  all  r  £  8(q,o)  do  not  have  the  same  parity, 
then  fXq ,o)  is  undefined. 


*  A  deterministic  FSA  (DFSA)  is  an  FSA  where  the  transition  function  8  is  from 
Qx£  into  Q. 

*  The  language  L(M)  accepted  by  a  DFSA,  M,  is  the  set  {x  £  E*  I  S(qo,  x)  €  F}. 

*  Given  a  regular  language  L,  we  denote  by  Ml  the  (up  to  isomorphism) 
minimum  state  DFSA  s.t.  L(Ml)  =  L.  Ql  is  the  state  set  of  Ml- 

The  late  lower  case  letters  v,  w,  x,  y,  z  will  range  over  strings.  Given  a  current 
FSA,  the  new  string  to  be  considered  is  denoted  by  w  (the  wrecker  that  may  break 
the  machine).  Lower  case  letters  p,  q,  r,  s,  t  range  over  names  of  states.  Whenever 
the  current  w  wrecks  the  current  guess,  a  new  state,  denoted  by  s  (supplemental 
state)  is  added,  o,  <J>,  qr  will  range  over  letters,  and  i,  j,  k,  m,  n  over  the  natural 
numbers. 

*  Mx  =  (Qx,  £,  qo,  Fx)  is  the  FSA,  referred  to  also  as  the  guess,  after  the 
finite  prefix  ±A,  ±x  of  the  complete  lexicographically  ordered  sequence. 
Ex  is  the  corresponding  set  of  links. 

*  For  x  €  E* ,  succ(x)  stands  for  the  string  following  x  in  the  lexicographic  order. 

*  The  incremental  construction  of  Mx  admits  for  every  state  q,  a  unique  string 
minwordiq)  that  leads  from  qo  to  q  using  permanent  links  only.  The  path  for 
minwordi q)  is  referred  to  as  the  basic  path  to  state  q.  These  basic  paths,  which 
cover  all  the  permanent  links,  form  a  spanning  tree  on  the  set  Qx 

*  The  guessing  procedure  also  establishes  for  any  M*  and  any  string  y,  unless  y 
=  minwordi q)  for  some  q,  the  existence  of  a  unique  state  p,  a  letter  <J>  and  a 
string  z  (  E*,  so  that  y  =  minwordi p)$z,  and  all  the  links  from  p  under  <J>  are 
mutable  links.  We  refer  to  these  state,  letter  and  string  as  the  tested  state, 
tested  ' ctter  and  tested  tail  (respectively)  for  y  in  Mx.  Figure  4  shows  the  tree  of 
all  the  paths  for  some  string  in  some  FSA,  indicating  the  tested  state  p,  tested 
letter  $  and  tested  tail  z: 

We  use  the  convention  of  representing  a  permanent  link  by  =*  and  a  mutable  link  by 
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Figure  4 

*  For  a  given  Mx  and  a  word  y,  a  path  for  y  in  Mx  is  right  if  it  ends  with  an 
accepting  state  and  y  €  L,  or  it  ends  with  a  rejecting  state  and  y  €  L. 
Otherwise,  this  path  is  called  wrong. 

*  For  two  strings  x,y  €  £*  ,  and  a  language  L,  x  =  l  y  if  both  strings  are  in  L,  or 
both  are  not  in  L. 

3b.  The  learning  algorithm 

Let  L  be  the  regular  language  to  be  incrementally  learned.  Initially  is 
constructed  according  to  the  first  example  ±A.  Q*  =  {qo};  E*  =  {qo  o  qo  I  o£  £}  and 
each  link  is  mutable.  If  L,  then  qo  is  an  accepting  state,  otherwise  it  is  a  rejecting 
one.  minwordiqo)  is  set  to  A. 

Given  Mx,  the  value  minwordi q)  for  every  q€Qx,  and  a  new  string  ±w,  w  = 
succ(x),  the  learning  algorithm  for  constructing  the  new  Mw  is  given  in  Figure  5. 
The  algorithm  is  annotated  with  some  important  assertions  (invariants  in  some 
control  points)  written  between  set  brackets  {...}. 

The  subroutine  delete-bad-paths  (M,  y,  accept)  is  a  procedure  that  constructs  a 
new  FSA  out  of  the  given  M,  in  which  every  path  for  y  leads  to  an  accepting  state  iff 
accept  =  true.  In  the  case  y  =  w,  delete-bad-paths  breaks  all  wrong  paths  (if  any)  for 
w  in  M.  In  the  case  y  <  w,  the  paths  in  M  are  checked  against  the  behavior  of  the  old 
machine  old-M.  In  any  case,  each  bad  path  for  y  in  M  is  broken  by  deleting  its  first 
mutable  link.  Note  that  all  the  first  mutable  links  along  bad  paths  are  from  the  same 
tested  state  p  for  y  in  M.  Furthermore,  if  all  the  paths  for  y  in  M  are  bad  (and  we  will 
show  that  this  can  happen  only  if  y  =  w),  then  after  the  execution  of  delete -bad-paths 
there  will  be  no  link  from  p  under  the  tested  letter  for  y  in  M.  Such  an  execution  will 
be  followed  by  an  execution  of  insert-state. 
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begin 

old-M  <-M*; 

if  w€L  then  accept-w<— true  else  accept-w*-false; 
new-M<—  delete-bad  -paths  (old-M,  w,  accept-w); 
if  there  is  no  path  for  w  in  new-M 
then  {all  the  paths  for  w  in  M*  are  wrong} 

{old-M  is  consistent  with  all  strings  up  through  x} 
repeat 

new-M  insert-state; 

{new-M  has  a  new  state  and  all  feasible  links  to  and  from  it} 
{new-M  may  be  inconsistent  with  previous  strings} 

y 

while  succ(y)  <  w 
begin 

y  *-succ(y); 

if  all  the  paths  for  y  in  old-M  lead  to  accepting  states 
then  accept  <—  true 
else  accept  ♦-false; 

{there  exists  a  right  path  for  y  in  new-M} 
new-M  *-delete-bad-paths  (new-M,  y,  accept) 

{new-M  is  now  correct  with  respect  to  the  strings  A, y} 

end; 

old-M  *-  new-M; 

{old-M  is  consistent  with  all  strings  up  through  x} 
new-M  <-  delete -bad -paths  (new-M,  w,  accept-w) 
until  there  exists  a  path  for  w  in  new-M; 
output  new-M  {Mw  will  be  the  new  FSA  new-M} 

end 


Figure  5:  The  learning  algorithm 
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The  procedure  insert-state  constructs  a  new  FSA  by  extending  the  given  spanning 
tree  defined  by  the  permanent  links  in  old-M.  A  new  state  s  is  added.  Let  p  and  4)  be 
the  tested  state  and  tested  letter  for  w  in  old-M.  Note  again  that  all  the  mu  table 
links  from  p  under  $  had  been  deleted  in  the  last  execution  of  delete -bad -paths.  A 
new  permanent  link  (p  $  s)  is  added,  minword(s)  is  set  to  minwordi p)<$>.  The  parity 
of  s  is  set  under  the  following  rule:  If  minwordis)  =  w,  then  s  is  an  accepting  state  iff 
accept  =  true.  In  other  words,  in  that  case  the  parity  of  s  is  opposite  to  those  states 
at  the  ends  of  the  paths  for  w  in  old-M.  If  minwordi s)  <  w  then  s  is  an  accepting 
state  iff  all  the  paths  for  minwordi s)  in  old-M  end  with  accepting  states.  Next, 
mutable  links  to  and  from  the  new  state  s  are  added  according  to  the  following  rule: 
For  any  existing  state  q,  and  for  any  letter  o,  if  minword(s)o  >  minwordi q),  then  add 
the  mutable  link  (s  o  q).  Also,  in  the  other  direction,  if  the  current  links  from  q  under 
o  are  all  mutable,  and  minwordiq)o  >  minwordis ),  add  the  mutable  link  (q  o  s). 
Note  that  this  rule  adds  (for  q  =  s)  all  possible  self  links  for  the  new  state  s.  In  other 
words,  for  every  letter  a,  the  mutable  link  (s  o  s)  exists  after  insert-state. 

Given  M*  and  w  =  succ(x),  if  all  the  paths  for  w  in  Mx  are  wrong,  then  the  repeat 
loop  takes  place.  This  loop  defines  the  extension  process,  which  is  a  repetition  of  one 
or  more  applications  of  the  insert-state  procedure.  It  is  easy  to  see  that  there  will  be 
at  most  |w|  -  \minwordip)\  insertions  (applications  of  insert-state),  where  p  is  the 
tested  state  for  w  in  Mx.  Suppose  there  are  i  insertions  between  Mx  and  Mw,  each 
adding  a  new  state.  We  can  refer  to  a  sequence  of  length  i  of  machines:  Mow,  Miw, ..., 
each  of  which  is  the  old-M  at  the  beginning  of  the  repeat  loop.  Mow  is  the  old- 
M  as  set  to  Mx  at  the  beginning,  the  others  are  iteratively  set  in  the  body  of  the 
repeat  loop.  For  every  j,  0  s  j  s  i-1,  the  execution  of  insert-state  defines  a  new 
machine  out  of  Mjw,  referred  to  as  Mjw(A).  Thereafter,  for  every  j,  0  sj  si-1,  and  for 
every  y,  succ(A)  s  y  <  w,  the  execution  of  delete -bad -paths  within  the  while  loop 
defines  a  new  machine  (possibly  the  same  as  the  preceding  one),  referred  to  as 
Mjw(y),  indicating  that  this  machine  is  ensured  to  be  consistent  with  those  strings  up 
through  y. 

The  algorithm  was  successfully  implemented  as  a  student  course  project  by  Lori 
Cohn,  using  C. 

Before  going  on  to  the  correctness  proof  of  this  algorithm,  we  will  discuss  an 
example  over  the  alphabet  £  =  {a, 6}.  Suppose  that  the  unknown  language  L  is 
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"number  of  as  is  at  most  1,  and  number  of  6’s  is  at  least  1”  or  bb*(\  +  06*)  +  abb *  as 
given  by  a  regular  expression.  Figure  6  below  shows  some  of  the  guesses. 

Initially,  since  the  first  input  example  is  -A,  qo  is  a  rejecting  state,  having  both  (a. 
6)  self-loop  mutable  links.  For  the  next  example  -a,  delete-bad-paths  does  not 
change  the  machine,  hence  M<*  is  the  same  as  MV  When  +6  is  encountered,  the 
mutable  link  (qo  b  qo)  is  deleted,  and  the  repeat  loop  takes  place  (Mo^  =  M« A  new 
state  s  =  qi  is  added,  and  a  new  permanent  link  (qo  b  qi)  is  added.  minword(q\ )  is 
set  to  b.  Since  minwordi qi)  is  the  current  example,  qi  gets  the  opposite  parity  from 
that  of  qo-  Hence,  qi  is  an  accepting  state.  The  new  mutable  links  are  (qi  a  qi),  (qi  6 
qi),  (qi  a  qo)  and  (qi  b  qo).  Note  that  (qo  a  qi)  is  not  added,  since  a  =  minword{qo)a  is 
less  than  6  =  minword(q\).  The  new  machine  is  Mo^(A).  Since  all  the  paths  (there  is 
only  one)  for  a  in  Moft(A)  are  right  with  respect  to  the  old  machine  MoV  we  get  that 
Mo6(A)  =  Moft(a).  The  only  path  for  b  is  right,  hence  M&  is  Mo6(a).  The  examples  -aa 
and  +ab  do  not  change  the  current  guess.  When  +  ba  is  encountered,  there  exists  a 
right  path  <qo  hqi,  qi  a  qi>  and  there  exists  a  wrong  path  <qo  b  qi,  qi  a  qo>.  The 
first  (and  only)  mutable  link  (qi  a  qo)  along  the  wrong  path  is  deleted.  A  similar 
treatment  is  involved  for  the  example  +  bb.  Note  that  at  this  stage  M bb  is  a  DFSA, 
but  obviously  L(MW>)*  L. 

The  next  string  - aaa  does  not  change  the  current  guess,  but  -aab  causes  a  new 
application  of  insert-state.  A  new  state  q2  is  added,  with  minword(q2)  being  a.  The 
string  aa  is  the  first  string  that  changes  the  machine  while  testing  Mo <*“&(&)  against 
the  old  machine  Mo0^.  The  new  mutable  link  (q2  a  qi)  is  deleted.  Other  new 
mutable  links  are  deleted  while  retesting  ab,  ba  and  bb.  The  execution  of  delete -bad  - 
paths  on  Mo^^aaa)  (=  Moaai(66))  deletes  the  two  mutable  links  (q2  a  q2)  and  (q2  a 
qo),  hence  causing  the  nonexistence  of  a  path  for  aab  in  the  new  machine.  Thus,  a 
new  insertion  is  applied,  replacing  the  two  mutable  links  (q2  a  q2)  and  (q2  a  qo)  by  a 
new  permanent  link  from  q2  to  a  new  state  q3  under  a.  The  retesting  of  the  strings  a, 
6,  and  aa  against  Mi  aab  (=  Mo^^aaa))  cause  no  change  (no  deletion)  on  Miaa&(A). 
Some  of  the  new  mutable  links  are  deleted  while  retesting  ab,  ba,  bb  and  aaa.  In 
Mioa6(aaa)  there  exist  three  right  paths  for  aab,  and  one  wrong  path,  causing  the 
deletion  of  (q3  6  qi),  yielding  M<k*V  Given  this  guess,  Moo*,  the  only  path  for  aba  is 
wrong.  Note  that  this  path  has  two  mutable  links  and  the  first  one  is  now  replaced 
by  a  permanent  link  to  the  new  accepting  state  q4.  The  retesting  deletes  some  of  the 
new  mutable  links  just  recently  added  within  the  insertion  of  q4. 


Mo**)  =  M06(a)  =  M*  =  Ma«  =  M»*  : 


=  Mo01"  : 


il^Hbb)  =  M ,oot  : 


M«**« : 


Figure  6 
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When  -baa  is  checked,  given  Mabb,  there  are  three  right  paths  and  two  wrong 
paths.  The  tested  state  is  qi,  the  tested  letter  is  a,  and  the  tested  tail  is  a.  The  first 
mutable  link  along  the  wrong  paths  is  (qi  a  qi).  Hence,  this  link  is  deleted  leaving 
(qi  a  q4)  as  the  only  mutable  link  from  qi  under  a.  This  link  is  the  first  mutable  link 
along  the  three  right  paths.  When  -aaab  is  checked,  given  Maaaa,  there  are  again 
three  right  paths  and  two  wrong  paths.  This  time,  two  mutable  links  (q3  a  qo)  and 
(q3  a  q2)  are  deleted,  each  breaking  a  different  wrong  path.  Note  that  this  deletion 
leaves  only  one  right  path  for  aaab,  as  the  two  deleted  links  served  as  second 
mutable  links  along  two  of  the  original  right  paths  for  aaab  in  Maaaa.  The 
correctness  proof  will  show  that  after  the  deletion  process,  there  exists  at  least  one 
right  path  for  the  current  sample.  Finally,  M abba  accepts  the  language  L. 

3c.  The  correctness  proof 

Next  we  prove  that  the  guessing  procedure  is  correct.  The  first  observation  is  to 
show  that  each  M*  satisfies  a  set  of  invariants.  Consider  the  following  constraints 
for  a  FSA  M  =  (Q,  2, 8,  qo,  F)  and  a  given  string  x  €  2*: 

(1)  Consistency: 

Vy  <x,  all  the  paths  for  y  are  right. 

(2)  Completeness: 

(V  q  i  Q)  (V  o  i  2) 

((3  r  i  Q)  (8(q,o)  =  {r}  and  (q  o  r)  is  a  permanent  link) 

or 

(3  Q'  C  Q)  (8(q,o)  =  Q'  and  Q'  *  0  and  (Vr  €  Q')  ((q  o  r)  is  a  mutable  link))) 

(3)  Separability: 

(Vq,  r€Q|q*r)(3y£  2*)  (mmioord(q)y  =£  l  minwordi  r)y  and  minwordi  q)y 

=  x  and  minword(r)y  =  x) 

(4)  Minimality: 

((Vq  i  Q)  (Vy  €  2*)  I  q  i  8  (qo,  y))  (y  ^  minwordi q)) 


(5)  Minword-property: 
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(Vq  €  Q)  (x  >  minwordi q)  and  there  is  a  unique  string,  namely  mm  word <q). 
that  has  a  path  from  qo  to  q  using  permanent  links  only) 

Note  .that  some  properties  refer  to  a  designated  sample  string  x.  We  say  that  M  y- 
satisfles  a  set  of  properties  if  whenever  some  property  in  this  set  relates  to  x,  and  y  is 
substituted,  then  M  satisfies  the  corresponding  conjunct  of  properties. 

The  consistency  constraint  is  the  obvious  invariant  we  would  expect  the  learning 
algorithm  to  maintain.  The  completeness  constraint  means  that  for  every  state  and 
for  every  letter  there  is  either  a  permanent  link  that  exits  this  state  under  this 
letter,  or  else  there  is  a  nonempty  set  of  links  leaving  the  state  under  this  letter,  all 
of  which  are  mutable.  The  separability  constraint  together  with  Myhill-Nerode 
theorem  [HU  79]  will  lead  to  the  claim  that  for  each  sample  x,  the  number  of  states 
in  Mx  is  at  most  the  number  of  states  in  the  minimum-state  DFSA  for  the  inferred 
language  L.  This  can  be  established  by  continually  preserving  the  minimality 
constraint.  The  minword-property  together  with  the  completeness  constraint 
implies  the  existence  of  the  spanning  tree  formed  by  the  permanent  links. 

Following  are  simple  facts  that  are  implied  by  the  above  properties.  We  will  refer 
to  these  facts  frequently  in  the  sequel. 

Fact  1.  Vq€Q,  Vy€Z*,  there  is  a  path  in  M  for  y  from  the  state  q.  (Implied  by  the 
completeness  constraint.) 

Fact  2:  Vy£E*,  if  Vq€Q,  y*  minword(q),  then  there  exists  a  unique  tested  state, 
tested  letter  and  tested  tail  for  y  in  M.  (Implied  by  the  completeness  constaint.) 

Fact  3:  Vy€£*,  if  y  s  x,  then  all  the  paths  for  y  from  qo  lead  either  to  accepting 
states,  or  they  all  lead  to  rejecting  states.  (Implied  by  the  consistency  constraint.) 

Fact  4:  Vq€Q,  mmioord(q)  has  a  unique  right  path  from  qo  to  q  through  permanent 
links  only.  (Implied  by  the  consistency,  completeness  and  minword-property 
constraints.) 


Fact  5:  Vq€Q,  Vy€E*,  Vz€E*,  if  there  exists  a  path  for  y  from  qo  to  q  that  uses  at 
least  one  mutable  link,  then  yz  >  minword{q)z.  (Implied  by  Fact  4  and  the 
minimality  constraint.) 
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The  correctness  of  the  algorithm  of  Figure  5  will  be  established  by  showing  that 
after  each  completion  of  the  algorithm,  yielding  a  new  guess  Mw  by  using  the  current 
sample  ±  w,  the  five  constraints  are  w-satisfied. 

Clearly,  M-11  A-satisfies  the  constraints.  Suppose  (inductively)  that  Mx  x-satisfies 
these  constraints,  and  let  w  =  succ(x). 

If  all  the  paths  for  w  in  are  right,  then  Mw  =  Mx,  and  if  M*  x-satisfies  the 
invariants,  then  Mw  w-satisfies  them. 

By  the  minword-property  constraint,  w  >  minwordiq)  for  each  q.  By  Fact  2,  let 
p,  <J>  and  z  be  the  tested  state,  tested  letter  and  tested  tail  (respectively)  for  w  in  M*. 
Consider  any  of  the  states  r,  so  that  (p  $  r)  €  Ex.  By  the  definition  of  the  tested 
elements,  (p  <|>  r)  is  a  mutable  link.  By  Fact  5,  minword(r)z  <  w.  Therefore, 
minword(r)z  had  been  already  checked.  Hence,  by  Fact  3,  all  the  paths  for  z  from  r 
behave  the  same,  i.e.  8X  (r,z)  C  Fx  or  8*  (r,z)  C  Qx  -  Fx.  Thus,  all  the  paths  for  w  that 
use  the  mutable  link  (p  <j>  r)  are  either  all  wrong  paths,  or  all  of  them  are  right  paths. 

If  there  exist  a  wrong  path  and  a  right  path  for  w  in  Mx,  then  by  breaking  each 
possible  wrong  path  for  w  by  delete-bad-paths ,  the  consistency  constraint  is 
w-satisfied  in  Mw.  The  crucial  point  for  establishing  the  completeness  constraint  in 
this  case  is  that  the  deleted  mutable  links  are  ensured  to  have  a  rival  (another 
mutable  link)  that  will  definitely  exist  in  Mw.  Hence,  for  every  ysw,  there  exists  a 
right  path  for  y  in  the  new  Mw.  The  other  three  constraints,  separability, 
minimality  and  minword-property  are  obviously  w-satisfied. 

If  all  the  paths  for  w  in  Mx  are  wrong,  the  expansion  process  takes  place.  Suppose 
there  are  i  insertions  in  between  Mx  and  Mw.  We  will  show  that  the  intermediate 
FSA’s  Mow,  Miw,  ...,  Mi.iw  all  x-satisfy  the  consistency,  the  completeness,  the 
minimality  and  the  minword-property  constraints.  Moreover  ,  all  the  paths  for  w  in 
Mjw  (0  s  j  si-1)  are  wrong,  causing  the  re-application  of  insert-state. 

Mow,  being  the  same  as  Mx,  obviously  x-satisfies  the  consistency,  the 
completeness,  the  minimalty  and  the  minword-property  constraints,  and  all  the 
paths  for  w  in  Mow  are  wrong. 
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Suppose  for  the  moment  that  Mjw,  0  <  j  <  i-1,  x-satisfies  these  four  constraints. 
By  the  minword-property  constraint,  w  >  minwordi q)  for  each  q.  By  Fact  2,  let  p,  4) 
and  z  be  the  tested  state,  tested  letter  and  tested  tail  (respectively)  for  w  in  Mjw.  Let 
s  be  the  new  state  in  Mjw(A).  In  constructing  Mjw(A),  the  whole  set  of  mutable  links 
from  p  under  <J>  in  Mjw  are  deleted  but  they  are  replaced  by  the  new  permanent  link, 
(p  4>  s).  This,  plus  the  fact  that  we  add  all  possible  self-looping  links  for  the  new  state 
s,  establishes  the  part  of  the  completeness  constraint  that  ensures  a  nonempty  set  of 
links  for  each  state  and  each  letter.  The  other  part  -  indicating  that  this  set  is  either 
a  singleton  of  a  permanent  link,  or  a  set  of  mutable  links  —  is  easily  implied  by  the 
construction.  By  the  definition  of  insert-state ,  and  the  fact  that  permanent  links  are 
never  deleted  in  delete -bad -paths,  Mjw(A)  obviously  w-satisfies  the  minword- 
property.  As  for  the  minimality  constraint,  suppose  by  way  of  contradiction  that 
there  exist  a  state  q  and  a  string  y,  so  that  there  is  a  path  for  y  that  leads  from  qo  to  q, 
and  y  <  minwordi q).  By  the  minword-property,  this  path  uses  at  least  one  mutable 
link.  By  the  minimality  constraint  of  Mjw,  at  least  one  of  those  mutable  links  is  a 
new  one  just  added  while  constructing  Mjw(A).  Consider  one  of  them,  (r  o  t).  (Note 
that  it  is  either  the  case  that  r  =  s  or  t  =  s.)  From  the  way  new  links  are  added  we 
immediately  get  a  contradiction  to  the  mimimality  assumption  of  y.  Hence  we 
conclude  that  Mjw(A)  w-satisfies  both  the  completeness,  the  minword-property  and 
the  minimality  constraints. 

The  retesting  process  (within  the  while  loop)  checks  the  current  machine  Mjw(y) 
against  the  old  machine  Mjw,  that  is  assumed  to  be  consistent  up  through  x.  The 
whole  retesting  process  involves  defining  a  sequence  of  FSA’s:  Mjw(A),  Mjw(succ(A)), 
Mjw(succ(succ(A))), ...,  Mjw(x).  We  will  show  that  each  Mjw(y)  is  consistent  up  through 
y,  and  that  it  w-satisfies  the  completeness,  the  minword-property  and  the 
minimality  constraints.  For  this  we  need  to  refer  to  another  inductive  hypothesis, 
that  will  indicate,  for  each  y  >  A,  the  similarity  between  Mjw(y)  and  Mjw(A).  Let  E' 
be  the  set  of  mutable  links  that  were  added  while  constructing  Mjw(A)  from  Mjw.  We 
will  claim  that  each  execution  of  delete -bad-paths  along  the  construction  of  Mjw(y) 
out  of  Mjw(A)  deletes  (if  at  all)  only  links  from  E'.  Moreover,  in  the  next  paragraph 
we  define  a  subset  of  E’  that  will  definitely  remain  in  Mjw(x).  A  link  in  this  subset 
will  be  called  a  necessary  link.  Intuitively,  these  links  will  establish  the  existence  of 
right  paths  on  Mjw(A)  that  reconstruct  paths  in  Mjw  that  use  one  of  the  mutable  links 
in  Mjw  from  p  under  $. 
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A  link  (s  a  t)  in  Mjw(A)  is  necessary  iff  t  *  s,  w  >  minword(s)o ,  and  there  is  a  path 
for  minword(s)a  from  qo  to  t  in  Mjw.  Figure  7  below  shows  how  Mjw  and  Mjw(A)  relate 
to  each  other  with  respect  to  p,  s  and  t.  The  mutable  link  (s  o  t)  will  definitely  be 
added  while  constructing  MjW(A),  because  by  by  Fact  5  applied  on  Mjw,  minu’ord(s)a 
=  minword(p)$o  >  minwordit). 


Figure  7 


In  order  to  prove  that  Mj  +  iw  x-satisfies  the  consistency,  the  completeness,  the 
min  word-property  and  the  minimality  constraints  (given  that  Mjw  satisfies  these 
conditions),  we  refer  to  another  property,  namely  the  similarity  constraint:  For  each 
j,  0  <.  j  <.  i-1,  and  each  y,  A  ■&  y  ss  x,  Mjw(y)  is  the  same  as  MjW(A)  except  for  the 
removal  of  some  of  the  new  mutable  links  that  were  added  while  constructing  Mjw(A) 
out  of  Mjw.  All  the  necessary  links  still  exist  in  Mjw(y). 

For  every  j,  0  s  j  <  i-1,  and  for  every  y,  0  s  y  <  x,  we  will  prove  the  following 
intermediate  invariant:  MjW(y)  satisfies  the  completeness,  the  minimality  and  the 
similarity  constraints,  it  y-satisfies  the  consistency  constraint,  and  w-satisfies  the 
min  word-property  constraint. 

We  have  already  shown  that  MjMA)  preserves  the  completeness  and  the 
minimalty  constraints,  and  that  it  w-satisfies  the  minword-property  constraint. 
Obviously,  it  satisfies  the  consistency  up  through  A,  and  the  similarity  (to  itself). 
Thus,  assume  inductively  that  Mjw(y),  A  s  y  <  x,  satisfies  the  intermediate 
invariant,  we  need  to  show  that  so  does  Mjw(succ(y)). 
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If  the  current  execution  of  delete -bad -paths  causes  no  deletions  (all  the  paths  for 
succ(y)  are  right  with  respect  to  old-M),  Mjw(succ(y))  trivially  maintains  the 
intermediate  invariant. 

Otherwise,  we  show  that  it  cannot  be  the  case  that  all  the  paths  for  succ(y ;  in 
Mjw(y)  are  wrong  with  respect  to  old-M  ( =  Mjw).  Moreover,  if  there  exists  a  wrong 
path  for  succ(y),  it  will  be  broken  by  deleting  one  of  the  non-necessary  new  mutabk 
links. 

Assuming  succ(y)  has  a  wrong  path  in  Mjw(y),  we  get  by  Fact  4  that  succ(y)  * 
minwordiq)  for  each  state  q  in  Mjw(y).  By  Fact  2,  let  r,  ip  and  v  be  the  tested  state, 
the  tested  letter  and  the  tested  tail  for  succ(y)  on  Mj*'(y).  (As  before,  r  is  the  last  state 
reached  by  permanent  links.)  We  distinguish  between  two  possible  cases: 

1)  r  =  s,  i.e.  succ(y)  =  minword(s)^\ . 

Therefore,  in  Mjw,  the  tested  state  and  the  tested  letter  for  succ(y)  were  p  and 

4>.  Figure  8  shows  the  relation  between  Mjw  and  Mjw(y)  with  respect  to  p  and  s. 


a  necessary  link 

Figure  8 

Let  q,  t  be  states  so  that  (p  <J>  q)  and  (q  ip  t)  are  links  in  Mjw.  By  the  similarity 
constraint  of  Mjw(y),  there  must  be  some  paths  for  succ(y)  on  Mjw(y)  that  use  the 
existing  necessary  link  (s  ip  t).  By  Fact  5  applied  to  Mjw(y),  minwordi t)v  < 
mtniuor<i(s)ipv  =  succ(y).  Finally,  by  the  consistency  constraint  of  MjMy),  all  the 
paths  for  minwordi t)v  in  Mjw(y)  are  right.  Clearly,  by  Fact  3  applied  to  Mj'v, 
minword(t)v  =  l  succ(y),  which  implies  in  turn  that  all  these  paths  for  succ(y)  in 
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Mjw(y)  that  use  the  existing  necessary  link  (s  ip  t)  must  be  right.  Hence,  this 
necessary  link  (s  ip  t)  will  not  be  deleted.  Other  non-necessary  mutable  links 
from  s  that  establish  wrong  paths  for  succ(y)  (and  there  exists  such  a  wrong  one) 
will  be  deleted  by  the  current  execution  of  delete-bad-paths. 

2)  r  *  s.  Hence,  r,  ip  and  v  serve  as  the  tested  state,  tested  letter  and  tested  tail 
for  succ(y)  on  Mjw  also,  and  succ(y)  =  minword(r)\pv.  Figure  9  below  indicates 
the  relation  between  Mjw  and  Mjw(y)  in  this  case.  Let  t  be  a  state  in  Mjw  so  that 
some  paths  for  succ(y)  in  Mj^  use  the  mutable  link  (r  \p  t)  (right  after  the 
permanent  prefix).  By  the  similarity  constraint  of  Mjw(y),  some  paths  for  succ(y) 
on  Mjw(y)  use  this  existing  mutable  link.  Again,  by  the  consistency  constraint, 


Fact  3  and  Fact  5  applied  to  Mjw(y),  all  the  paths  in  Mjw(y)  for  minwordl t)v 
behave  the  same  and  are  right.  By  Fact  3  applied  to  Mjw,  minword{t)v  =  l 
succ(y).  Hence,  all  the  paths  for  succ(y)  in  Mjw(y)  that  use  (r  \p  t)  are  right, 
implying  in  turn  that  this  (old)  mutable  link  will  not  be  deleted.  By  the 
assumption,  there  exists  a  wrong  path  for  succ(y)  in  Mjw(y).  Hence,  there  exists  a 
new  mutable  link  in  MjW(y),  (r  ip  s),  that  will  be  now  deleted  in  order  to  break  a 
bad  path.  This  new  mutable  link  is  clearly  a  non-necessary  one. 

This  terminates  the  discussion  on  the  relation  between  Mjw(y)  and  Mjw(succ(y)), 
and  based  on  this  we  can  easily  conclude  that  the  intermediate  invariant  is  satisfied 
by  Mjw(succ(y)).  Consequently,  MjMx)  satisfies  this  intermediate  invariant,  and  in 
particular  it  is  consistent  up  through  x.  For  0  s  j  <  i-1,  Mj+iw  =  Mjw(x).  We  get 
the  desired  hypothesis  that  this  MJ  + 1*  x-satisfies  the  consistency,  the  completeness 
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and  the  minimality  constraints,  and  that  all  the  paths  for  w  in  this  new  machine  are 
wrong.  Since  Mj*(x)  w-satisfies  the  min  word- property  constraint,  and  there  exists  a 
wrong  path  for  w  in  Mjw(x),  we  get  by  Fact  4  applied  to  Mjw(x)  ( =  Mj  + 1 w)  that  Mj  + 
x-satisfies  the  minword-property  constraint.  For  j  =  i-1,  we  break  all  the  wrong 
paths  for  w  on  Mi_iw(x)  by  deleting  the  first  mutable  links  along  them.  By  similar 
arguments  as  above,  we  get  that  the  new  Mw  w-satisfies  the  consistency,  the 
completeness,  the  minword-property  and  the  minimality  constraints. 

The  last  thing  to  be  shown  is  that  Mw  (obtained  after  the  extension  process)  w- 
satisfies  the  separability  constraint. 

Suppose  that  in  executing  the  repeat  loop  we  have  inserted  i  new  states.  For  0  s  j 
s  i-1,  let  Sj  + 1  be  the  new  state  added  while  constructing  Mjw(A.)  from  Mj*  .  Obviously 
Qw  =  Qx  U  {si,  ...,  Si}. 

We  say  that  two  states  q,  r  are  w-separable  if  3y  €  £*  such  that  minwordi q)y 
minwordi r)y,  where  minword( q)y  <  w  and  minwordi r)y  s  w. 

We  prove  by  induction  on  j,  0  <  j  s  i,  that  each  pair  of  states  in  Qx  U  {sk  I  k<j }  is 
w-separable.  The  basic  assumption,  for  j  =  0,  is  directly  implied  from  the  fact  that 
Mx  w-satisfies  the  separability  constraint. 

Assuming  that  each  pair  of  states  in  Qx  U{sk  I  k  <  j  and  j  <  i }  is  w-separable,  we 
have  to  show  that  each  state  in  this  set  is  w-separable  from  Sj  + 1.  Formally,  let  Q  be 
the  state  set  of  Mjw,  we  need  to  show  that  (Vq  €  Q)  (By  i  E*)  (minwordi q)y  *l 
minwordi Sj  +  i)y  and  minwordicfiy  s  w  and  minword(s^  +  \)y  s  w). 

Let  p,  <J>  and  z  be  the  tested  state,  tested  letter  and  tested  tail  for  w  in  Mjw,  and  w 
=  minwordi Sj  +  i)z.  Let  q  be  an  arbitrary  state  of  Mjw.  We  distinguish  between  the 
case  where  q  was  connected  to  p  in  Mjw  through  the  mutable  link  (p  $  q),  versus  the 
case  where  they  were  not  connected  like  this. 

1)  Mjw  has  the  mutable  link  (p  $  q).  By  the  corresponding  execution  of  delete- 

bad-paths,  this  link  will  be  deleted  and  replaced  in  Mjw(A)  by  the  new  permanent 

link  (p  $  Sj  + 1).  (p  4>  q)  was  deleted  since  all  the  paths  for  w  =  minwordis j  +  \)z  on 

Mjw  were  wrong.  By  Fact  5  applied  on  Mjw,  minword(q)z  <  minwordi p)<|>z  =  w. 

By  the  consistency  constraint  of  Mjw,  all  the  paths  for  minwordi q)z  in  Mjw  are 
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right.  Hence,  minwordi Sj  +  i)z  minwordiq)z,  minwordi sj-i)z  -  w  and 
minword(q)z  <  w. 

2)  The  link  (p  4>  q)  does  not  exist  in  Mjw.  There  can  be  two  different  sub-cases 
under  this  condition.  The  first  one  is  that  (p  <J>  q)  had  never  been  added  while 
constructing  one  of  the  previous  FSA’s.  The  second  sub-case  is  that  (p  <£  q)  has 
once  been  deleted.  Note  that  in  each  of  the  previous  FSA’s,  the  links  that  leave  p 
under  <J>  are  always  mutable  links  (and  Mjw(A)  will  be  the  first  FSA  having  a 
permanent  link  from  p  under  $.  Hence,  if  (p  $  q)  has  been  once  added,  and 
thereafter  deleted,  the  deletion  was  due  to  an  execution  of  delete-bad-paths. 

2.1)  (p  $  q)  had  never  been  added.  Since  it  is  not  the  case  that  there  exists  a 
permanent  link  from  p  under  $  (establishing  a  possible  reason  for  not  adding  (p  <J> 
q)),  it  must  be  that  minword( p)$  <  minwordiq). 

Now  clearly  there  exists  a  mutable  link  in  Mjw  from  p  under  <J>.  Let  t  be  a  state 
such  that  (p  <J>  t)  exists  in  Mjw.  By  the  induction  hypothesis,  q  and  t  (being  two 
distinct  states  in  Mjw)  are  w-separable,  hence  3y  £  £*,  so  that  minword{q)y  s  w, 
minwordi t)y  sw  and  minwordi q)y  *l  minwordit)y.  Figure  10  shows  the  relation 


Mjw  : 


Mjw(A)  : 


Figure  10 

between  Mjw  and  Mjw(A)  with  respect  to  p,  q  and  t. 

By  Fact  5  applied  to  Mjw,  minwordit)  <  minwordip)$.  By  the  initial 
assumption  for  this  subcase,  minwordip)$  <  minwordi q).  Thus,  since 
minwordi q)y  sw  and  minwordit)y  sw,  we  get  that  both  minwordi t)y  and 
minwordip)$y  are  less  than  w.  By  the  consistency  constraint  of  Mjw,  all  the  paths 
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for  minword{p)$y,  and  all  the  paths  for  minwordi t)y  are  right,  and  clearly 
minwordi p)$y  =  l  minwordi t)y.  Since  y  is  separating  between  t  and  q,  and 
minwordi p)<J>  =  minword(sj  +  \),  we  can  conclude  thaty  is  a  sufficiently  small  tail 
separating  between  Sj  + 1  and  q. 

2.2)  The  next  subcase  deals  with  the  mutable  link  (p  $  q)  being  deleted  due  to 
some  string  v,  v  <  x.  (p  <J>  q)  has  been  the  first  mutable  link  along  a  wrong  path 
for  v  on  some  previous  FSA.  Thus,  there  exists  y  €  E*,  so  that  v  =  minwordi p)<J>y. 
The  automata  in  which  the  decision  to  delete  (p  <J>  q)  has  been  taken  was  obviously 
consistent  with  respect  to  minwordi q)y,  as  minwordi q)y  <  v  by  the  minimality 
constraint  which  is  continually  satisfied.  This  establishes  the  claim  that 
minwordip)$y  *l  minwordi q)y.  As  minwordiq)y  <  minwordip)$y  = 
minwordisj  +  i)y  <  x,  we  get  a  perfect  separating  string  for  q  and  Sj  + 

This  finishes  the  proof  of  the  claim  on  the  separability  constraint. 

Let  Rl  be  the  equivalence  relation  (Myhill-Nerode  relation)  associated  with  L,  so 
that  for  x,y  €  E*,  x  Rl  y  iff  ( V z  €  £*)  (xz  =  l  yz).  By  Myhill-Nerode  theorem  [HU  79], 
|Ql|  =  number  of  equivalence  classes  of  Rl- 

By  the  separability  constraint,  Vx  €  E*,  iQx|s  |Ql|-  Thus,  there  exists  x*  €  E*, 
such  that  Vys?x*  |Qy|  =  |Qx*|  (after  reaching  M**  the  extension  process  would  never 
be  applied  again).  Consider  such  an  FSA,  My  where  y  >  x*.  Suppose  there  is  a  state 
q  €  Qy  for  which  there  are  at  least  two  distinct  mutable  links  (q  o  r)  and  (q  o  t).  By 
the  separability  constraint,  3  z  £  E*  such  that  minwordi r)z  *l  minwordi t)z.  If  both 
links  still  exist  while  considering  the  string  minwordiq)oz  (by  the  consistency 
constraint,  minwordiq)az  >  y)  then,  at  this  stage,  one  of  them  will  definitely  be 
deleted.  Hence,  eventually  we  will  get  a  DFSA.  Moreover,  if  Mx  is  a  DFSA  where 
|QXI  <  |QH»  tb®11  3y>x  so  that  the  path  for  y  in  Mx  is  wrong.  Therefore,  we  finally 
conclude  that  eventually  we  will  get  a  minimum-state  DFSA  isomorphic  to  Ml.  This 
completes  the  correctness  proof. 

Having  proven  the  correctness  of  the  learning  algorithm,  we  now  consider  some 
properties  of  this  process,  mainly  with  respect  to  time  and  space  complexity. 
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3d.  Complexity  analysis 

Given  a  current  guess  M*,  the  value  minword{q)  for  every  q  €  Q*  and  the 
successor  sample  string  ±w,  we  analyze  first  the  time  complexity.  Let  |QX|  =  n,  )w] 
=  m,  and  |Lj  =  a.  Obviously,  the  size  of  the  whole  sample  considered  so  far  A. 
succ(A),  succ(succ(A)), w  is  exponential  in  m  (greater  than  am  i ). 

First  we  note  that  executing  delete -bad-paths  on  M*  and  w  requires  only 
polynomial  time  with  respect  to  the  size  of  Mx  and  w.  Notice  that  the  algorithm  does 
not  need  to  check  every  path  of  w  in  Mx,  but  rather  to  consider  one  path  for  each  first 
mutable  link  along  paths  for  w  in  M*.  This  is  due  to  the  observation  we  have  already 
made  that  if  (p  $  q)  is  some  first  mutable  link  for  w  in  M*,  then  all  the  continuing 
paths  from  q  behave  the  same.  Each  path  for  w  can  be  checked  in  polynomial  time, 
and  there  can  be  at  most  n  different  paths  to  be  checked.  Moreover,  note  that  within 
this  process,  that  must  be  applied  for  each  string,  the  tested  state  and  tested  letter 
can  be  recorded,  and  this  might  later  be  used. 

If  insert-state  is  activated,  then  it  can  be  done  in  polynomial  time,  gaining  some 
efficiency  by  using  the  recorded  tested  state  and  tested  letter.  The  dominating  step 
is  the  addition  of  all  possible  new  mutable  links  that  obviously  involves  considering 
each  existing  state  q  (and  its  minword ),  and  each  letter  a.  Insert-state  can  be 
repeatedly  activated,  for  at  most  m  times,  and  it  is  generally  much  less. 

The  retesting  while  loop  is  repeated  for  each  string  y,  succ(A)  sy  sx.  For  each 
such  y,  the  condition  that  determines  the  value  of  the  boolean  variable  accept  can 
again  be  checked  in  polynomial  time.  Note  that,  due  to  the  consistency  constraint, 
only  one  path  for  y  in  the  current  old-M  has  to  be  checked  to  determine  the  behavior 
of  the  old  machine  with  respect  to  y.  The  total  retesting  process  (reaching  the  case 
for  which  succ(y)  =  w)  can  take  exponential  time  in  the  size  of  the  current  example 
w,  since  all  previous  strings  are  checked,  but  it  is  still  polynomial  in  the  size  of  the 
whole  sample  considered  so  far.  Such  retesting  happens  very  infrequently,  in  fact  it 
is  invoked  once  per  state  insertion.  Therefore,  the  amortized  cost  of  the  retesting 
process  is  polynomial  in  the  size  of  the  current  input.  Finally,  we  conclude  that  the 
whole  amount  of  time  needed  for  the  learning  algorithm  is  polynomial  in  n  in  an 
amortized  sense. 
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Furthermore,  the  extension  process  can  be  somewhat  improved.  First  we  change 
in  insert-state,  the  rule  on  adding  a  new  mutable  link  from  an  old  state  q  to  the  new- 
one  s  under  some  letter  a.  The  new  rule  states  that  we  add  (q  o  s)  when  two 
conditions  are  met.  The  first  one  is  the  original  one,  namely  that  the  current  links 
from  q  under  o  are  all  mutable,  and  minword( q)o  >  minwordis).  The  second  one  is 
that  either  minu;ord(q)o  >  wor  minword(q)o  <  w  (so  that  the  current  parity  of  the 
pair  (q,  a)  is  defined)  and  s  is  of  the  correct  parity,  i.e.  fts)  =  f(q,o)  in  the  current  Mjw . 
This  rule  is  obviously  correct,  since  if  all  the  current  links  from  q  under  o  are 
mutable,  and  minwordis)  <  minword{q)o  (so  that  (q  o  s)  would  have  been  added 
under  the  previous  rule),  and  moreover  minword(q)o  <  w  and  fi(s)  *  ffq,o)  (so  that  (q 
o  s)  would  not  have  been  added  under  the  new  rule),  then  the  retesting  process  will 
definitely  prune  this  new  mutable  link  while  considering  y  =  minword{ q)o.  Hence, 
omitting  those  links  immediately  in  insert-state  might  achieve  some  efficiency  in  the 
inspection  of  all  possible  paths  for  some  y  within  the  while  loop. 

A  more  significant  improvement  is  due  to  the  fact  that  within  the  retest  process 
(the  while  loop)  only  new  mutable  links  (to  and  from  the  new  state)  might  be  deleted 
as  being  first  mutable  links  along  wrong  paths  for  some  y.  Consequently,  we  need 
only  check  the  following  subset  of  the  sample.  For  each  state  q  and  symbol  o,  such 
that  a  new  mutable  link  from  q  under  o  has  been  just  added  within  the  last  execution 
of  insert-state,  test  the  strings  minword(q)oz  that  are  smaller  than  w.  Note  that  the 
new  rule  for  adding  new  mutable  links  has  now  a  more  considerable  impact  on  the 
performance,  by  omitting  a  whole  set  of  strings  from  being  checked.  There  are  still  n 
Xaizi  such  strings,  but  Izl  will  usually  be  small  and  there  even  might  be  states  for 
which  no  string  will  need  to  be  tested. 

As  stated  in  the  introduction,  one  of  the  major  goals  that  motivated  this  work  was 
to  design  an  algorithm  for  learning  an  FSA  that  will  require  only  a  modest  amount 
of  memory,  much  less  than  the  sample  size  which  is  exponential  in  m.  Clearly,  the 
storage  needed  for  the  current  guess  is  proportional  to  n2  X  a,  and  the  storage  needed 
for  all  the  values  minwordiq),  for  each  state  q,  is  proportional  to  n.  The  easiest  way 
to  envision  the  learning  algorithm  is  to  imagine  that  it  uses  two  separate 
representations  —  one  for  new-M,  and  the  other  for  old-M.  Taken  literally,  this 
would  double  the  size  of  the  storage  needed  for  the  current  guess.  A  more  efficient 
solution  is  to  indicate  some  links  on  the  current  guess  as  old  ones,  and  thus  analyze 
both  machines  on  "one”  representation.  Another  improvement  might  be  gained 
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(with  respect  to  the  amount  of  storage  needed  for  the  algorithm)  by  modifying  the 
insertion  process  so  as  to  avoid  the  need  for  storing  minword(q)  for  each  q.  These 
values  might  as  well  be  computed  for  every  new  state  by  linearly  traversing  the 
prefix-coded  tree  of  the  permanent  links. 

Let  |Ql|  =  nL-  As  shown  above,  the  current  size  n  of  the  machine  is  at  most  nt- 
Any  two  distinct  states  in  Ml  are  obviously  separable.  Moreover,  it  can  be  shown 
that  the  shortest  string  that  distinguishes  between  such  pairs  of  states  is  at  most  of 
length  nL-  Since  each  minword(q)  is  at  most  of  length  nL,  we  can  conclude  that  the 
maximum  string  after  which  the  current  guess  is  isomorphic  to  Ml  is  of  length  linear 
in  nL-  In  order  to  prove  this,  consider  a  guess  Mx  for  which  |Qxj  =  nL  and  |x|  > 
2XnL  + 1,  and  suppose  (by  way  of  contradiction)  that  Mx  is  not  isomorphic  to  Ml-  In 
other  words  Mx  is  non-deterministic,  hence  there  exists  a  state  p,  such  that  there  are 
at  least  two  distinct  mutable  links  from  p  under  some  letter  $.  Let  t  and  r  be  those 
states  having  incoming  mutable  links  from  p  under  $.  As  indicated  above, 
|minu/ord(p)|  s  nL,  and  there  is  a  string  of  length  at  most  nL  that  separates  between 
t  and  r.  Hence,  Mx  admits  two  different  paths  for  a  word  of  length  at  most  2  X  nL  + 1 , 
one  that  leads  to  an  accepting  state,  and  the  other  to  a  rejecting  one.  This  obviously 
contradicts  the  consistency  assumption  of  Mx.  In  summary,  our  algorithm  could  get 
by  with  space  proportional  to  a  X  hl2  to  store  its  guess  plus  m  for  the  current  string. 
This  corresponds  to  the  abstract  notion  of  iterative  learning  of  [Wi  76]. 

4.  Iterative  learning  using  finite  working  storage 

In  this  section  we  will  formally  characterize  some  ’'practical”  properties  of  the 
learning  algorithm  introduced  in  Section  3.  Taking  into  account  the  limitation  of 
space  in  all  realistic  computations,  the  most  important  property  of  our  algorithm  is 
that  for  every  sample,  given  in  a  lexicographic  order,  the  algorithm  uses  a  finite 
amount  of  space.  The  restriction  on  the  strict  order  of  the  sample  may  seem  to  be  too 
severe.  We  will  show  in  this  section  that  this  restriction  is  necessary  for  learning 
with  finite  memory. 

Our  first  definition  follows  the  approach  of  [Wi  76,  JB  81,  OSW  86]: 

Definition:  An  algorithm  IT  (iteratively)  identifies  a  set  of  languages  S  iff  for  every 
L  £  S,  given  the  complete  lexicographic  sample  <  ±A,  ±succ(A),  ±succ(succ(A))  ...>, 
the  algorithm  defines  a  sequence  of  finite  machines  <  Mo,  Mp  M2, ...  > ,  so  that  Vi  >  1 
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Mi  is  obtained  from  the  pair  (M^i,  ±  x;),  where  Xj  is  the  i-th  string  in  the  sample,  and 
3j  so  that  Vk  >  j  Mk  =  Mj  and  L(Mj)  (the  language  accepted  by  the  machine  Mj)  is  L. 

C 

It  is  easy  to  see  that  the  algorithm  of  Section  3  meets  the  requirements  of  the  IT 
definition.  In  other  words,  we  exhibit  an  algorithm  that  IT  identifies  the  FSL. 
Moreover,  it  finds  a  minimum-state  DFSA  for  the  inferred  language  L. 

We  now  define  a  weaker  characterization  of  a  learning  algorithm.  We  allow 
finite  working  storage  in  addition  to  that  required  for  defining  the  current  guess  and 
sample  string.  The  ultimate  goal  will  be  to  show  that  the  restriction  on  the  order  of 
the  sample  is  necessary  even  for  this  kind  of  algorithm. 

Definition:  An  algorithm  FS-IT  (iterative  algorithm  that  uses  finite  storage) 
identifies  a  set  of  languages  S  iff  for  every  L  €  S,  given  the  complete  lexicographic 
sample  <  ±  ±  succ(A),  ±  succ(succ(  A)), ...  > ,  there  exists  a  finite  set  of  states  Q,  such 

that  the  algorithm  defines  a  sequence  of  configurations  <(Mo,  qo),  Mi,  qi),  (M2,  q2), 
...>  that  satisfies  the  following:  Vi,  qj  €  Q,  Vi  >  1,  (Mi,qi)  is  obtained  from  the  triple 
(Mi-i,  qi-i,  ixi),  where  xj  is  the  i-th  string  in  the  sample,  and  3j  such  that  Vk  a  3  Mk 
=  Mj  and  L(Mj)  =  L.  Such  a  j  (in  the  course  of  learning)  is  referred  to  as  a  "semi¬ 
stabilization”  point.  Note  that  the  states  within  the  configurations  after  the  semi¬ 
stabilization  point  can  still  change. 

Obviously,  if  an  algorithm  IT  identifies  S,  then  it  also  FS-IT  identifies  this  set. 

A  (non-repetitive)  complete  sample  for  a  language  L  is  a  sequence  of  its  strings 
<  ±xi,  ±X2,  ±X3,  ...>  so  that  Vi,  xi  €  L,  Vi  *  j,  Xi  *  xj,  and  Vx  €  L  3i  so  that  x  =  xi. 
The  ability  to  learn  languages  by  presenting  an  arbitrary  complete  sample,  rather 
than  the  strict  lexicographic  one,  obviously  strengthens  the  characterization  of  the 
learner.  We  denote  the  above  situations  by  ITarb  and  FS-ITarb  if  we  do  not  require 
the  sample  to  be  in  lexicographic  order. 

For  a  complete  sample  <±xi,  ±X2,  ix3,  ...>,  an  algorithm  that  FS-ITarb 
identifies  a  set  of  language  S  uses  a  finite  set  of  states  Q,  and  defines  a  sequence 
<(Mo,  qo),  (Mi,  qi),  ...>  .  Mi  and  qj  €  Q  are  referred  to  as  the  current  guess  and  the 
current  state  After  the  finite  prefix  <±xj,...,  ±Xi>,Vi  2  0. 
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Theorem.  There  is  no  algorithm  that  FS-ITarb  identifies  the  finite  state  languages. 

Proof.  Suppose,  to  the  contrary,  that  an  algorithm  A  FS-ITarb  identifies  the  FSLs. 
We  will  look  for  some  FSLs  that  will  lead  to  a  contradiction. 

Let  Lo  be  2*,  for  some  alphabet  2. 

For  the  lexicographically  ordered  sample  for  Lo,  A  defines  the  sequence  <(Mq°, 
qo°),  (MiO,  qjO),  ...>.  By  the  definition  of  FS-ITarb  and  in  particular  due  to  the 
finiteness  of  the  working  storage,  there  exists  some  semi-stabilization  point,  i,  that 
corresponds  to  some  word  x,  such  that  the  following  two  conditions  are  satisfied: 

1)  The  current  quess  Mm0,  Vm  >  i,  is  the  same  as  and  characterizes  Lo.  Call 
this  guess  Ml0. 

2)  There  are  infinitely  many  m’s,  m  a  i,  such  that  qm°  =  qi°  (i.e.  the  state  qi° 
occurs  infinitely  often). 

Let  Li  =  {w  £  2*  |  w  <  x}. 

For  the  lexicographically  ordered  sample  for  Li  <  +A, ...,  +  x,  -succ(x),  A 

defines  the  sequence  <(MoL  qo1),  (MiL  q^),  ....  (MiL  q^),  (Mi  +  il,  qi+il),  ...>. 
Obviously,  Vm,  0  <;  m  s  i,  Mm°  =  Mm1  and  qm°  =  qmL  In  particular  =  Ml0 
and  qil  =  qi°.  By  the  infinitely  repeating  property  of  qiO,  and  by  the  finiteness 
condition  on  the  set  {qml  |  m  a  0},  qjO  must  coincide  with  some  qml  infinitely  often. 
In  other  words,  there  exists  a  j  that  corresponds  to  some  word  z,  j  >  i,  so  that  the 
following  three  conditions  are  satisfied: 


1)  qjO  =  qi0 . 


2)  Vm  >  j  Mmi  =  Mji  and  characterizes  Li.  Call  this  guess  Ml v 

3)  There  are  infinitely  many  m’s,  m>j,  such  that  qm°  =  qi°  and  qml  =  qji  (the 
pair  of  states  qjO,  qji  appears  infinitely  many  times  at  the  same  points  for  the 
strict  ordered  samples  for  Lo  and  Li). 

Pick  a  place  k,  k  >  j,  so  that  qk°  =  qi°  and  qkl  =  qji.  The  existence  of  such  a 
place  is  established  by  the  properties  of  the  chosen  j.  Let  y  be  the  string  at  place  k  in 
the  lexicographically  ordered  sample  of  any  language  over  2.  Note  thaty  >  z. 
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Let  L2  =  Lj  U  { w  (  P  |  z  <  w  <  y }, 

For  the  following  ordered  sample  <  +A,  +x,  -fsucc(z),  +y,  -suce(x),  -z, 

-succ(y),  -succ(succ(y)),  ...>,  A  defines  the  sequence  <(Mo2,  qo2)»  (Mi2,  qi2),  ...>. 
From  the  above  we  get  (see  Figure  11): 

(1)  After  the  finite  prefix  <  +  A, +  x  >  the  current  guess  is  Ml0  and  the  state  is 

qi°- 

(2)  By  the  definition  of  z  and  y,  after  the  finite  prefix  <  +  A,  +  x,  +  succ(z), 

+  y  >  the  current  guess  is  still  Ml0,  and  the  state  is  again  qiO. 

(3)  By  the  definition  of  z  with  respect  to  its  occurrence  in  the  strict  ordered 
sample  for  Li,  after  the  finite  prefix  <  +A, ...,  +x,  +  succ(z),  +y,  -succ(x), 
-z>  the  current  guess  Mk2  is  Mlp  and  qk2  =  q/1. 

(4)  By  the  property  of  qjl,  Vm  >  k,  Mm2  =  Mlj,  and  qm2  =  qm1. 

Hence  A  cannot  FS-ITarb  identify  L2. 


□ 

Thus  we  have  shown  formally  that  the  FSL  can  be  IT-identified  (from 
lexicographically  ordered  samples)  but  can  not  be  ITarb-identified.  The  theorems 
provide  end-case  results,  but  there  is  a  wide  range  of  possible  presentation 
disciplines  between  IT  and  ITarb.  Obviouly  enough,  our  algorithm  will  still  identify 
the  FSL  from  presentations  in  which  redundant  strings  happen  to  be  missing.  That 
is,  any  w  such  that  w  =  succ(x)  and  Mw  =  M*  could  be  missing  from  the  sample 
without  effect.  As  stated  above,  for  any  FSL  for  which  JQlJ  =  n,  namely  an  n-FSL, 
all  the  strings  longer  than  2  X  n  + 1  will  be  redundant.  Moreover,  most  strings  of 
length  at  most  2  X  n  + 1  could  be  missing  from  the  sample.  This  follows  because 
there  are  only  a  X  n2  links  to  be  added  or  deleted  and  more  than  a2xn+ l  strings  of 
length  s  2  X  n  +  1.  Therefore  a  teacher  could,  in  principle,  get  by  with  a  greatly 
reduced  presentation  if  she  knew  what  to  present.  The  same  reduction  could  also  be 
used  in  the  retest  phase  of  our  algorithm. 
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Lo  :  +  X, +  x, 


ML„q,° 


+  z,  +  succ(z), +y, 

Mu.-qi®  ML„,q,0 


Li: 


+  A, ..., +x,  -succ(x),  ...,•  -z,  -succ(z), 

Ml.,,  qj0  ML.,qji 

-succ(y),  -succ(succ(y)), 

M^qk  +  i1  ML^qk  +  21 


-y. 


M 


L.-qj1 


L2 : 


+  A, +  x,  +  succ(z), ...  +  y,  -succ(x), ...,  -z, 

Ml0,  Qi°  MLq.  Qi°  ML[>  qjl 

-succ(y),  -succ(succ(y)), 

M^.qk  +  l1  MLj.qk  +  21 


Figure  11 

We  conjectured  that  perhaps  for  an  ordered  presentation  of  some  particular 
subset,  followed  by  an  arbitrary  sequence  of  the  remaining  strings,  our  algorithm 
would  be  still  applicable. 

Let  Sl  be  the  set  {minword{ q)  |  q€  Ql},  and  let  y  be  the  maximal  string  in  this  set. 
We  easily  found  a  counterexample  that  shows  that  the  subset  {x  €  2*  |  x  <  y}  does  not 
suffice,  and  it  is  not  even  the  case  that  Qy  --  the  set  of  states  in  the  last  guess  My  -  is 
the  same  as  Ql-  We  then  examined  a  larger  set.  Let  Sl'  be  the  set  {x  |(3  p,q  €  Ql)  (x 
=  minword(p)z  and  x  *l  minword{<\)z  and  (Vz'  €  2*)  (ntinword(p)z'  minword{q)z' 
=>  x  s  minword( p)z'))}.  The  intuition  behind  this  set  is  to  include  all  the  least  strings 
that  can  distinguish  between  two  distinct  states  in  Ml-  First  we  observe  that  Sl  C 
Sl'.  Consider  some  state  p  in  Ql.  There  must  be  some  q  in  Ql  the  parity  of  which 
differs  from  that  of  p,  hence  minwordip)  *l  minword(q).  Obviously,  for  every  z'  € 
2*,  minwordi p)  s  minword(p)zr .  Hence,  minword(p)  €  Sl'.  Now,  let  y'  be  the 
maximal  word  in  Sl'.  The  following  counterexample  shows  that  even  the  subset  {x  € 
2*  |  x  s  y'}  does  not  suffice.  Let  L  be  the  language  (given  as  an  example  in  Section  3): 
"the  number  of  a’s  is  at  most  1,  and  number  of  b’s  is  at  least  1.”  Sl'  =  {A,  a,  b,  aa,  ab, 
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6a,  aab,  aba}.  After  the  finite  prefix  <A,  -a,  +6,  -aba> ,  M  oba  is  as  shown  in 
Figure  12. 


Figure  12 

Obviously,  M<*6a  *  Ml.  Moreover,  let  -bbaba  be  the  next  sample  string.  The  rule 
for  breaking  all  wrong  paths  by  deleting  the  first  mutable  links  along  them  cannot 
be  applied  in  this  case.  The  first  mutable  link  along  the  bad  path  <qo  6  qi,  qi  6  qi, 
qi  a  qi,  qi  6  qi,  qi  a  qi  >  is  the  link  <qi  6  qi  >  that  should  remain  in  Ml.  In  fact, 
every  other  possible  rule  that  determines  the  mutable  link  to  be  deleted,  according  to 
its  place  in  the  bad  path,  would  not  work  here.  For  every  i  =  1,  2,  3, 4,  there  is  a  bad 
path  whose  i-th  mutable  link  exists  in  Ml. 

It  is  an  open  question  whether  any  characterization  of  the  minimum  training  set 
exists. 

It  is  also  easy  to  see  intuitively  why  finite  storage  learners  will  fail  on  arbitrary 
presentations.  An  arbitrary  presentation  can,  for  example,  have  only  very  long 
strings  for  a  very  long  time  and  the  learner  has  no  idea  what  to  make  of  them.  This 
is  the  basic  cause  of  the  NP-completeness  results  of  [Go  78]  and  [An  78]  for  minimal 
DFSA  learning.  On  the  other  hand,  if  the  learning  device  knew  in  advance  the  size, 
n,  of  the  n-FSL,  it  might  be  able  to  collapse  the  long  sample  strings  into  equivalence 
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classes.  This  is  another  open  question.  The  realistic  version  of  this  is  for  the  learner, 
which  has  finite  storage,  bounded  by  some  polynomial  in  n,  to  limit  its  guesses  to  M 
with  no  more  than  n  states. 

There  do  seem  to  be  some  general  consequences  of  the  outcome  of  these  open 
questions.  If,  as  we  surmise,  knowing  a  bound  on  n  for  the  target  n-FSL  does  not 
permit  FS-ITar&  with  the  finite  storage  being  bounded  by  a  polynomial  in  n,  then 
learning  simple  examples  first  has  inherent  major  advantages.  If  there  are  optimal 
training  presentations,  it  will  be  interesting  to  understand  their  nature.  As  we  will 
show  in  the  next  section,  the  algorithm  of  Section  3a  works  in  a  way  that  is 
compatible  with  connectionist  and  thus  (at  least  for  some  people)  with  neural 
computation. 

5.  Distributed  and  connectionist  versions 
5a.  Distributed  realization 

As  we  discussed  in  the  introduction,  there  is  no  generally  accepted 
formalization  of  what  precisely  constitutes  a  connectionist  model.  In  this  section  we 
show  how  the  algorithm  of  Section  3b  can  be  translated  into  a  network  of  simple 
computing  units  that  falls  within  the  range  of  connectionist  models.  In  particular, 
the  network  involves  only  simple  units  that  broadcast  very  simple  outputs  on  all 
their  outgoing  links.  Learning  is  realized  by  local  weight-change  that  restructures 
the  network.  There  is,  of  course,  no  interpreter,  but  there  is  some  central  control. 
There  are  several  places  in  the  construction  where  system-wide  parallelism  holds, 
but  there  are  also  sequential  aspects  that  seem  to  be  inherent.  For  any  finite  system 
to  recognize  unbounded  inputs,  it  will  have  to  look  at  pieces  of  the  input 
sequentially.  Also  in  the  retest  phase  of  our  algorithm,  it  is  necessary  to  test 
individual  samples  ssequentially . 

The  conceptual  distance  from  the  algorithms  and  analysis  above  to  the  sketched 
connectionist  version  is  considerable  and  we  will  traverse  it  in  two  steps.  We  will 
start  with  a  realization  in  terms  of  a  module-message  model,  like  PLITS  or  CSP. 
Each  state  of  the  target  FSA  will  be  represented  by  a  module  and  there  will  be  a 
control  module  and  several  other  fixed  modules. 


Each  state-module,  q,  will  have  data  structures  for  its  activation  state,  its  parity, 
its  minimal  string  minworcft q),  and  its  outgoing  links  to  other  states  and  other 
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modules.  We  suppose  that  the  system  is  synchronous  and  that  the  control  module 
broadcasts  each  letter  of  the  input  string,  w,  at  the  start  of  a  major  cycle.  The  first 
benefit  of  the  parallel  implementation  is  that  all  paths  for  the  target  string  can  be 
checked  in  parallel.  Initially,  qo  is  active.  Each  active  module  looks  at  the  next 
symbol,  o,  and  sends  an  "activation”  message  along  each  of  its  outgoing  links  that 
correspond  to  o.  When  these  signals  have  been  sent,  any  state  that  has  not  received 
new  activation  inactivates  itself,  and  the  states  that  have  received  such  signals  are 
active  for  the  next  cycle.  There  are  three  kinds  of  activation  messages  that  are  sent 
by  states  along  paths  of  w.  Recall  the  notation  p  and  $  for  the  tested  state  and  tested 
letter  of  w  on  the  current  machine.  The  first  kind  of  an  activation  message  is  sent  by 
those  states  along  the  basic  path  for  minwordip).  This  message  indicates  that  no 
mutable  link  has  yet  occurred.  Let  p  be  the  first  state  to  use  mutable  outgoing  link,  p 
sends  a  message  that  encodes  its  identity,  p,  and  that  of  the  tested  letter,  $.  Each 
active  module  q  that  receives  a  message  of  the  second  kind,  encoding  the  pair  (p,  <£), 
looks  at  the  next  symbol  o,  and  sends  an  activation  message  of  the  third  kind  along 
each  of  its  outgoing  links  that  correspond  to  o.  This  message  encodes  the  triple  (p  <t> 
q),  indicating  a  specific  first  mutable  link  along  a  path  for  w.  All  successor  states 
that  receive  this  kind  of  message  send  that  same  message.  Note  that  each  path  is 
represented  by  its  first  mutable  link,  encoded  within  the  message  that  passes 
through  the  corresponding  suffix. 

At  the  end  of  the  string,  marked  by  a  terminator  1-  or  by  a  control  signal,  the 
states  that  are  active  report  their  corresponding  first  mutable  links  plus  their  parity 
±.  This  could  be  reported  directly  to  the  controller  or  (more  connectionist)  by 
sending  activation  to  global  variables  (or  modules)  that  represent  good  and  bad 
strings.  The  control  now  compares  the  provided  answer  and  if  all  reports  are  right,  it 
goes  on  to  the  next  string,  as  before. 

If  some  parses  are  right  and  some  are  wrong,  a  deletion  process  corresponding  to 
the  subroutine  delete-bad-paths  must  be  executed.  As  mentioned  and  proved  in 
Section  3,  each  of  the  first  mutable  links  along  paths  for  w  corresponds  either  to  a  set 
of  right  paths,  or  to  a  set  of  wrong  paths.  The  controller  can  obviously  identify  all 
paths  by  their  first  mutable  links,  and  knows  which  ones  were  right  or  wrong  from 
the  reported  parity.  It  then  composes  a  message  to  all  the  "bad”  mutable  links,  and 
sends  this  information.  Each  module  that  is  the  origin  of  a  bad  mutable  link  will 
then  delete  the  corresponding  outgoing  link,  thus  breaking  a  bad  path  for  w. 
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Finally,  we  need  to  model  the  state-addition  and  retest  procedures.  Clearly  the 
control  can  easily  discover  if  all  paths  are  wrong.  A  new  state  is  "recruited”  by 
generating  a  new  module,  s,  and  initializing  it  with  its  parity,  its  path,  minwordis) 
and  with  its  permanent  link.  All  this  information  can  be  easily  generated  using  the 
data  structure  within  the  module  that  corresponds  to  the  tested  state  p,  namely  the 
one  that  identifies  itself  as  the  source  of  all  the  bad  mutable  links  to  be  deleted  at 
that  stage.  With  this  information,  the  new  state/module  can  establish  links  with  old 
states  following  the  strictures  of  insert-state.  Again,  this  can  be  done  in  parallel 
except  for  the  serialization  within  module  s  itself.  The  retest  procedure  is 
sequential,  the  controller  cycles  through  the  required  strings  and  tests  them  against 
the  old  machine.  The  difference  between  the  links  of  the  old  and  new  machines  are 
also  part  of  the  data  structures  of  the  appropriate  modules.  Of  course,  within  each 
string  test,  the  parallel  checking  and  deletion  above  still  apply.  Much  of  this  will 
carry  over  to  the  connectionist  version,  but  there  are  also  several  differences. 

5b.  Connectionist  realization 

Connectionist  models  in  the  literature  vary  somewhat  [RM  86,  WF  87]  but  all  are 
restricted  to  simple  units  that  pass  only  numerical  messages  and  always  send  the 
same  number  on  each  outgoing  link.  The  links  may  have  weights  that  modify  the 
value  being  received  and  many  models  also  allow  conjunctive  connections  like  we 
used  in  Figure  2.  Rochester  practice  allows  for  a  unit  to  have  a  small  amount  of 
internal  data  and  to  be  in  one  of  a  small  number  of  different  "states”  which  we  will 
denote  here  as  "modes”  to  reduce  confusion.  The  limited  repertoire  of  connectionist 
systems  forces  the  use  of  more  elaborate  structures  than  the  previous  version.  We 
will  present  an  outline  of  one  such  model. 

The  connectionist  version  of  the  FSA  learner  will  have  a  control  subnetwork  that 
will  sequence  and  modulate  the  basic  learning  net.  There  will  be  "registers,”  banks 
of  units  whose  activation  pattern  represents  a  letter  or  a  string  of  letters,  like  the  top 
row  of  Figure  2.  The  basic  process  of  testing  a  string  against  the  current  guess  works 
as  outlined  in  Figure  2.  Each  letter  of  the  input  string  serves  in  turn  as  the  gate  on 
the  conjunctive  connections  from  state-unit  to  state-unit.  The  conjunction  of 
activation  of  the  prior  state  and  the  appropriate  letter-unit  leads  to  activation  of  the 
next  state-units  along  appropriate  links.  One  way  to  have  the  state-units  turn  off 
when  they  should  is  to  have  a  just-sent  mode.  A  unit  in  just-sent  mode  will 
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inactivate  itself  (set  its  activation  flag  to  zero)  if  it  receives  only  a  control  signal  for 
the  next  cycle. 

A  somewhat  similar  mechanism  can  be  used  to  mark  the  first  mutable  link  along 
each  path.  Suppose  that  permanent  links  have  weight  1  and  mutable  links  weight  -3. 
Let  the  activation  rule  for  a  unit  be  as  follows:  The  initial  state  qo  always  sends 
activation  value  10  to  start  each  string.  If  a  subsequent  state-unit  gets  input  10,  it 
also  sends  10  because  this  was  a  permanent  link.  When  a  state-unit  sees  an  input 
value  of  5,  it  knows  that  it  is  at  the  far  end  of  the  first  mutable  link  in  a  path,  and 
will  record  which  input  link  was  active.  It  will  also  send  out  a  lower  value,  say  4. 
Units  that  receive  either  2  or  4  will  also  send  out  4  as  a  value.  This  effectively  marks 
the  receiving  end  of  the  first  mutable  link  in  every  path,  with  a  tagged  input  in  the 
unit  for  that  state  marking  the  path. 

Upon  termination  of  the  testing  for  some  string,  w,  the  global  +  and  -  units  are 
compared  by  control  with  the  answer  provided  (as  activation  of  another  Winner- 
Take-All  pair).  If  all  paths  are  right,  then  the  next  string  is  tried.  If  there  are  both 
right  and  wrong  paths,  the  deletion  process  must  occur.  There  is  no  obvious  way  to 
do  this  in  parallel,  but  the  following  sequential  scheme  works.  Assume  that  the 
mechanism  includes  a  "buffer”  that  can  record  the  input  string  w  and  another  buffer 
that  can  be  made  to  cycle  through  strings.  Control  recycles  the  input  string  (in 
delete  mode)  until  the  unit  having  the  first  mutable  links  is  encountered.  That  is,  a 
state-unit  that  is  activated  in  delete  mode  and  has  its  tag  set  sends  a  different  signal 
which  is  detected  by  the  control  net.  Then  each  such  state  is  tested  sequentially  and 
the  ones  leading  to  wrong  answers  delete  their  corresponding  incoming  mutable 
link.  Deletion  can  be  just  setting  the  weight  to  zero.  There  needs  to  be  some 
mechanism  for  sequencing  these  states,  e.g.,  enabling  each  state  in  sequential  order. 

The  insertion  process  for  the  connectionist  version  involves  even  more  technical 
details.  It  is  reasonable  to  assume  that  the  learning  net  has  unused  state-units  that 
are  connected  to  all  the  ones  used  thus  far,  one  of  these  is  "recruited”  to  be  the  new 
state  s.  It  is  not  hard  to  determine  which  link  to  s  should  be  permanent;  it  is  the  one 
from  the  state  p  with  the  first  mutable  links  in  the  (wrong)  parses  of  w  = 
minword( p)$z.  The  string  w  could  be  reparsed  and  the  output  of  5  from  the  states 
that  receive  conjoined  signals  from  p  and  $  could  info.m  the  new  state,  s,  that  it 
should  set  the  weight  of  its  active  input  link  to  be  1  (permanent).  It  is  also 
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reasonable  to  assume  that  the  state-units  can  mark  the  current  links  from  p  under  4 
as  "old,”  and  thus  to  be  used  only  in  retest  mode. 

What  we  can  not  assume  is  that  state-units,  q,  can  store  the  minimal  string 
minwordi q)  and  compare  it  with  minwordis)  to  determine  which  (q  o  s)  and  (s  o  q) 
links  should  be  added.  Again,  the  apparent  answer  is  to  go  sequential.  We  car. 
assume  that  the  control  net  has  buffers  for  minwordis)  and  minwordiq),  minwordis) 
is  fixed  for  the  addition  process,  but  minwordiq)  cycles  through  all  the  other  existing 
states.  The  control  net  finds  minwordiq )  by  testing  strings,  and  with  this  in  the 
buffer,  the  tests  minwordi q)o  >  minwordi s)  and  minwordis)o  >  minwordiq)  can  be 
carried  out  by  the  control  net.  The  signal  to  break  the  appropriate  links  can  be 
transmitted  to  the  state-units  involved.  This  leaves  just  the  retesting  process.  The 
obvious  way  to  handle  this  is  to  have  the  control  net  cycle  through  every  string  y  < 
w  and  test  and  correct  the  current  guess.  The  basic  process  of  testing  a  string  and  the 
deletion  process  work  as  before.  Each  unit  needs  to  have  an  "old  machine”  and  "new 
machine”  mode  and  to  know  which  links  go  with  each.  Each  string  less  than  w  is 
tested  in  old  machine  mode  and  the  answer  is  stored.  Then  the  same  string  is  tested 
in  new  machine  mode  and  the  deletion  process  is  invoked  for  all  wrong  paths. 

In  this  design,  each  unit  would  need  internal  data  for  recording  its  number, 
whether  it  is  active,  has  an  active  first  mutable  link  and  is  performing  as  the  old  or 
new  machine.  It  would  need  "modes”  for  just-sent,  for  normal  testing,  for  deletion, 
for  recruiting  and  for  pruning  links.  If  we  restrict  ourselves  to  just  state-less  linear 
threshold  elements,  the  complexity  expands  by  an  order  of  magnitude. 

Obviously  enough,  a  fully  worked  out  connectionist  realization  would  be  quite 
complex  and  would  be  one  of  the  most  elaborate  models  yet  built.  Connectionist 
models  are  at  a  very  low  conceptual  level  and  this  always  leads  to  complications  in 
large  problems.  On  the  other  hand,  the  construction  outlined  above  does  no  great 
violence  to  connectionist  principles  and  could  be  turned  into  a  connectionist  FSA 
learning  machine.  This  would  be  much  more  general  than  any  existing 
connectionist  learning  network.  Since  the  details  of  each  parse  play  an  important 
part  in  the  learning  procedure,  there  are  at  least  indirect  connections  with 
explanation-based  learning.  But  the  case  of  learning  from  a  perfect, 
lexicographically  ordered  sample  is  a  very  special  one.  It  is  well  worth  exploring 
how  the  algorithms  of  this  paper  could  be  modified  to  deal  with  less  controlled 
examples.  An  obvious  approach  is  to  change  the  delete  process  (of  mutable  links)  to 
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one  that  reduces  weights  to  some  non-zero  value,  perhaps  halving  them  each  time. 
The  question  of  how  to  revise  the  rest  of  the  network’s  operation  to  properly  treat 
conflicting  evidence  is  another  topic  worthy  of  further  effort. 
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